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A Semi-Fast  Fourier  Transform  Algorithm  over  GF(2  ) 

by 
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Abstract 


An  algorithm  which  computes  the  Fourier  Transform  of  a sequence  of  length 

n over  GF(2'")  using  approximately  2nm  multiplications  and  n^  + nm  additions 

is  developed.  The  number  of  multiplications  is  thus  considerably  smaller  than 
2 

the  n multiplications  required  for  a direct  evaluation,  though  the  number  of 
additions  is  somewhat  larger.  Unlike  the  Fast  Fourier  Transform,  this  method 
does  not  depend  on  the  factors  of  n and  can  be  used  when  n is  not  highly 
composite  or  is  a prime.  The  bit  complexity  of  the  algorithm  is  analyzed  in 
detail.  Implementations  and  applications  are  briefly  discussed. 


The  author  is  with  the  Coordinated  Science  Laboratory  and  the  Department  of 
Electrical  Engineering,  University  of  Illinois  at  Urbana-Champaign,  Urbana, 
Illinois  61801. 

This  work  was  supported  in  part  by  the  Joint  Services  Electronics  Program 
(U.S.  Army,  U.S.  Navy  and  U.S.  Air  Force)  under  Contract  DAAB-07-72-C-0259. 


I.  Introduction 

The  Fast  Fourier  Transform  (FFT)  algorithm  of  length  n over  a finite 


field  [l]  is  essentially  the  well-knovm  complex  field  FFT  algorithm  (e,g.  [zJ) 
with  the  primitive  n*"^  root  of  unity  exp(j2n/n)  in  the  complex  field  being 
replaced  by  a primitive  n*"^  root  of  unity  in  the  finite  field  (or  an 
extension  thereof).  When  n is  composite,  with  factors  n^^  ,n2> . . . ,n^ , the 
finite  field  FFT  is  essentially  what  is  called  a mixed-radix  FFT  and 
requires  n(n^  "*■  •••  + ) multiplications  and  n(n,  + n.  + ...  + n ) 

L im  S X ^ S 

2 2 

additions  as  compared  to  the  n multiplications  and  n additions  required  to 
evaluate  the  Fourier  Transform  in  the  most  obvious  way.  If  n is  not  highly 
composite,  (or  if  n is  a prime),  the  saving  in  computation  is  quite  small 
( or  nonexistent).  In  such  cases,  the  Fourier  Transform  can  be  computed  from 
the  cyclic  convolution  of  two  appropriately  defined  sequences  of  length 
approximately  2n  or  more  [3]  - [s] . This  convolution  itself  can  be  computed 
by  computing  the  forward  transforms  of  the  two  sequences,  a pointwise  multi- 
plication of  the  transforms,  and  an  inverse  transform.  If  the  length  of  the 
sequences  is  chosen  to  be  highly  composite,  the  FFT  algorithm  can  be  used  to 
compute  the  three  transforms  and  significant  savings  in  computation  can  be 
achieved. 

For  small  to  moderate  values  of  n,  however,  the  cyclic  convolution 
technique  can  be  slower  than  direct  evaluation.  One  other  problem  that  arises 
with  finite  fields  is  that  the  finite  field  may  not  contain  an  appropriate 
primitive  root  of  unity,  and  computations  may  have  to  be  done  in  a much  larger 
field.  For  example,  if  we  wish  to  compute  the  transform  of  a sequence  of  31 


s 


2 


elements  from  GF(2^),  we  can  find  it  from  the  cyclic  convolution  of  two 
sequences  of  length  (say)  63.  Unfortunately,  the  smallest  field  that  contains 
GF(2^)  as  well  as  a primitive  63rd  root  of  unity  is  GF(2^^)  [s] , [?] . Of 
course,  one  might  cyclically  convolve  two  sequences  of  length  93,  vdilch  will 
allow  the  use  of  GF(2^^).  On  the  other  hand,  93  is  not  highly  composite. 

In  fact,  computing  three  FFTs  of  length  93  (or  63)  requires  more  computation 
in  a larger  field  than  a brute-force  evaluation  of  the  original  transform. 

The  situation  is  somewhat  better  if  one  redefines  the  problem  so  that 
computation  can  be  done  in  GF(p)  for  some  large  prime  p or  in  the  complex 
field  itself.  However,  these  techniques  will  not  be  analyzed  in  this  paper. 

The  algorithm  proposed  in  this  paper  requires  2(n-l)(m-l) 
multiplications  and  (n-l)(n+m-l)  additions  in  GF(2"”)  to  compute  a transform  of 
length  n,  n a prime,  over  GF(2™).  If  n is  not  a prime,  the  number  of  multi- 
plications is  somewhat  less  and  the  number  of  additions  is  somewhat  more. 

Since  multiplications  require  more  time  than  additions,  the  algorithm  is  some- 
what faster  than  the  direct  method,  though  both  have  arithmetic  complexity  of 
2 

the  same  order  0^(n  ).  The  bit  complexity  of  the  proposed  algorithm  is 
2 

0„(n  log  n)  which  is  better  by  a factor  of  log  n over  the  direct  method.  For 

O 

small  values  of  n,  the  proposed  method  is  also  superior  to  the  cyclic 
convolution  technique.  Asymptotically,  of  course,  the  cyclic  convolution 
technique  using  the  FFT  has  arithmetic  complexity  0^(n  log  n)  and  bit 
complexity  0 (n  log^n),  and  is  vastly  superior.  For  these  reasons,  the 
proposed  algorithm  is  dubbed  a Semi-Fast  Fourier  Transform  (S-FFT)  algorithm. 
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II.  The  S-FFT  Algorithm 

Let  n be  an  odd  integer, 

m the  multiplicative  order  of  2 modulo  n 
a a primitive  n*"^  root  of  unity  in  GF(2™) 

3 an  element  of  degree  m in  GF(2'").  It  is  convenient,  but  not 
necessary,  to  take  3 to  be  a primitive  element.  In  fact,  one 
can  take  3 = 

n-1  ^ 

Let  A(x)  = I!  A.x  be  a polynomial  over  GF(2*"). 

i=0  ^ m 

Every  element  of  GF(2  ) can  be  represented  as  a polynomial  of  degree 
less  than  m in  3 (see,  e.g.  [?]).  Thus, 


A 


i 


m-1 


E 

k=0 


'i,k 


3^^ 


where  A , 6 GF(2) 

1 ,K 


(1) 


Hence, 


A(x)  = A^°\x)  + 3A^^^(x)  + 3^A^^\x)  + ...  + 3®"^A^'""^\x)  (2) 


n-1 


(k)  “ i 

where  A (x)  = E Ax  is  a polynomial  over  GF(2). 

i=0 


n-1 


.J 


The  Fourier  Transform  of  A(x)  is  B(z)  = E B where 

j=0  J 


B = A(aJ)  = E A-Cq-J)^ 
J i=0 


(3) 


Using  (1),  it  is  easy  to  manipulate  (3)  to  give 


B = E [a^*'^  (a^)]  3*^ 
J k=0 


(4) 


sjBwmsaaaasaESH 
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If  denotes  the  Fourier  Transform  of  A^^\x),  then,  from  (2)  and  (4), 

we  get 


B(e)  = B^°\z) 


+ BB^^\z)  + 


b2b(2) 


(z)  + 


gm-lB(m-l)(z) 


(5) 


The  basic  idea  behind  the  algorithm  is  to  find  the  polynomials 
(k) 

B (z)  as  efficiently  as  possible  and  then  use  (5)  to  compute  B(z).  It  is 

easier  to  compute  B^  directly  from  (3)  using  (n-1)  additions  rather  than  from 

(5).  The  rest  of  the  coefficients  of  B(z)  can  be  computed  from  the  coefficients 
(kl 

of  the  B '^(z)'s  in  (n-l)(m-l)  multiplications  and  (n-l)(m-l)  additions  using 

(5).  We  now  show  that  the  coefficients  of  the  B^*^^(z)'s  can  be  computed  quite 

(k) 

rapidly  because  the  B (z)'s  are  transforms  of  binary  polynomials. 


[Bf  - [”e' 

j '-i=0 


i=0 


L,k' 


n-1  . 

2 A 
i=0 


in  a field  of  characteristic  2. 


since  A.  , e GF(2) 

1 ,K 


Thus, 


[b(">]2  = 


(6) 


Given  Bj*^\  one  can  compute  B^^\  B^^\  B^^\  ...,  etc.  (subscripts  taken 
modulo  n)  simply  by  successive  squarings.  Note  that  a^,  , ...,  etc. 
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are  conjugate  elements  in  GF(2  ) and  are  the  roots  of  the  same  irreducible 

polynomial.  Let  I(n)  denote  the  number  of  such  irreducible  polynomials 

(of  degree  greater  than  1)  that  are  divisors  of  x”-l.  If  we  compute 

for  I(n)  values  of  j,  then  all  the  other  (except  B^*^^  which  need  not 

be  computed  at  all)  can  be  obtained  by  squaring.  Now,  to  compute  B^*^^ 

requires  at  most  (n-1)  additions  because  the  A are  0 or  1 and,  hence, 

1 ,k 

either  (Q?^)^  is  added  to  the  sum  or  it  is  not.  Thus,  we  can  compute  all  the 
(k)  (k) 

coefficients  of  B (z)  except  B^  in  (n-l)I(n)  additions  and  (n-1)  - I(n) 

squaring  operations  (i.e.  multiplications)  in  GF(2‘”).  There  are  m such 
(k) 

polynomials  B'  (z)  and  thus,  we  find  that  all  the  coefficients  of  B (z) 
can  be  computed  using  a total  of  m(n-l)(I(n)+l)  additions  and 
m(n-l-I(n))  + (n-l)(m-l)  multiplications  over  GF(2”*). 

Let  0(*)  denote  Euler's  0 function  and  let  Y(d)  denote  the  multiplicative 
order  of  2 mod  d.  Thus,  ’f(d)  divides  Y(n)  = m if  d divides  n.  Then  we  have  [s] 


I(n)  = E 


If  n is  a prime,  then  we  get  I(n)  = (n-l)/m.  In  this  case,  B(z)  can  be  computed 


written  [s]  as 


I(n)  - - r 0(d)2‘"^‘^l  - 2 2 (2'"-2)/m 
”"d|m  -J 


using  2(n-l)(m-l)  multiplications  and  (n-l)(n-fm-l)  additions.  If  n is  not  a 

prime,  then,  using  the  fact  that  E 0(d)  = n,  we  get  that  I(n)  > (n-l)/m 

d|n 

and  thus  the  number  of  multiplications  required  is  somewhat  less  than  2(n-l)(m-l) 
while  the  number  of  additions  is  increased.  If  n = 2'°-l,  (7)  can  also  be 
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III.  Analysis  of  Bit  Complexity 

We  now  analyze  the  S-FFT  algorithm,  the  FFT  algorithm  and  direct 
computation  of  the  Fourier  Transform  to  determine  the  numbers  of  bit  operations 
required  for  each.  We  consider  an  implementation  consisting  of  a large  combina- 
tional network  with  nm  inputs  representing  the  and  nm  outputs  representing 
the  and  count  the  number  of  gates  required  to  build  such  a network.  For 
simplicity,  we  consider  the  case  n = 2™-l  only  and  also  take  a = B. 

For  direct  computation  of  the  Fourier  Transform,  the  B^'s,  j = 0,l,...,n- 
can  be  computed  by  Horner's  rule  as 

B.  =((...  ((A  + A + A + ...  + A. )a^  + A_ 

j n-i  n-z  n-j  1 t) 

For  j = 0,1,..., n-1,  we  need  (n-1)  multipliers  capable  of  multiplying  their 
inputs  by  Or^.  These  multipliers  can  be  implemented  as  linear  transformations 
of  the  m-bit  vector  inputs  (e.g.  T?]  p.  45).  The  rows  of  the  m X m matrix  of 
such  a transformation  are  the  m-bit  representations  (as  in  (1))  of  the  elements 
(^j+tn-1^  2^  ...,  , if  there  are  t ones  in  Che  matrix,  the  linear 

transformation  can  be  implemented  with  t-m  exclusive-OR  (XOR)  gates.  Note 
that  multiplication  by  1 corresponds  to  the  identity  transformation  and  thus 
requires  no  XOR  gates.  Consider  the  n m x m matrices  corresponding  to 
multiplication  by  1,  O',  a”  ^ which  are  exactly  the  n nonzero  binary 

m-tuples.  Each  nonzero  binary  m-tuple  occurs  in  m different  matrices.  Hence, 
the  n matrices  contain  a total  of  ^ ones  and  (2™-2)(m^2'””^-m(2™-l)) 

XOR  gates  are  required  to  implement  all  the  multiplications. 
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Since  (2"*-l) (2™-2)  additions,  each  requiring  m XOR  gates  to  implement,  are 
also  required,  ve  see  that  a total  of  m^2'”(2''*"^-l)  XOR  gates  is  required  to 
compute  the  Fourier  Transform  by  the  direct  method.  The  total  delay  through 
the  network  is  (2™-2) ( flog^ml  + 1)  gate  delay  , 

Another  possibility  in  the  direct  method  is  to  compute  B,  by  multiply- 
ji 

Ing  by  a simultaneously  for  i = 0,1,..., n-1  and  summing  the  products.  When 
n is  a prime,  the  hardware  requirements  are  exactly  the  same  as  before.  When 
n is  not  a prime,  the  hardware  requirements  actually  decrease  very  slightly. 

This  is  because  whenever  j and  n are  not  relatively  prime,  the  set  of  elements 
1,  , Qf^^,  ...,  used  in  computing  is  not  the  same  as  the  set  of 

elements  1,  a,  . . . , Of"  In  fact,  if  is  a primitive  d*"^  root  of  unity 

(where  d is  some  divisor  of  n),  then  each  of  the  elements  1,  , ..., 

l)j  exactly  n/d  times  in  the  former  set.  As  a consequence,  the 

hardware  required  to  compute  Bj  depends  on  the  order  of  the  element  or^.  In 
column  4 of  Table  1,  we  give  the  total  hardware  required  to  compute  Bj  where 

is  a primitive  d^*^  root  of  unity  for  nontrivial  divisors  of  n.  Since 

th 

there  are  0(d)  primitive  d roots  of  unity,  the  total  hardware  is  easily 
inferred  from  these  numbers.  The  delay  through  the  network  is  m + riog^nil  gate 
delays  which  is  considerably  smaller  than  that  required  for  Horner's  rule:  the 
saving  in  hardware  when  n is  not  a prime  is  less  than  1%. 


Turning  to  the  FFT  algorithm,  let  p be  the  smallest  nontrivial 

divisor  of  n and  let  n = pq.  Let  A = [Aq,Aj^,A2»  . . . ,A^_j^]  and  B = [Bq,Bj^ , . . . ,B^_ 

and  define  M to  be  a n x n matrix  with  entries  a,,  = a^^.i.i  = 0,1,..., n-1, 

— n,a  ij  » » • » 


Number  of  bit 
Operations  to 
Compute  B. 


Number  of  Bit 
Operations  to 
Compute  Bj 

after  Folding 


Number  of  Bit 
Operations  to 

Compute  Bj 


2 

26 

8 

26 

)02 

42 

2< 

L82 

126 

1 

L56 

160 

1' 

L46 

378 

11 

Number  of  Bit 
Operations  to 
/k) 

Compute  Bj 
after  Foldin 


20.767 


51.206 


122,453 

123.637 


17.062 


10.845 


454 

1.702 


Table  1:  Numbers  of  bit  operations  to  compute  B.  and  B'  where  a is  a primitive 

d*"  root  of  unity. 


computing  the  Fourier  Transform  is  equivalent  to  computing  B *=  A M . If  we 

^ " Tl 

rearrange  the  columns  of  M in  the  order  0, p ,2p, . . . , (q-l)p , 1 , 1+p, l+2p , . . . , 

n 

(p-l) , (p-l)+p, . . . (p-l)+(q-l)p  to  get  the  new  matrix  ^ then  A ^ ^ is  simply 

B with  its  elements  permuted  into  the  same  order.  However,  M*  factors  into 

“n,uf 

two  matrices  X and  Y where  X is  a p X p array  of  q X q diagonal  submatrices  and 
Y is  a p X p diagonal  array  of  q x q submatrices.  The  entries  along  the 
diagonal  of  Y are  all  identical  and  equal  to  ^p.  The  q X q diagonal 
submatrix  on  the  i*”^  row  and  column  of  the  p X p array  comprising  X 

(i,j  = 0,l,...,p-l)  is  diagCo'*^^,  as  in  Figure  1. 

Since  p is  the  smallest  divisor  of  n,  the  elements  ^ 

are  all  primitive  n^^  roots  of  unity.  It  follows  that  (p-l)  multiplications 
by  each  of  I ,oi ,a^ , , . . ,a^  ^ and  n(p-l)  additions  are  required  to  compute  A X. 

The  total  hardware  for  this  part  is  (p-l)m^2™  ^ XOR  gates  and  the  delay  is 
Tlog  ^ml  + l”log2P^  gate  delays.  Of  course,  we  still  have  to  multiply  by  Y, 
but  this  is  equivalent  to  computing  p transforms  of  length  q.  If  q is  not  a 
prime,  let  p'  be  the  smallest  nontrivial  divisor  of  q and  let  q = p'q'. 
Proceeding  with  the  factorisation  of  ^p  exactly  as  before,  we  get 

^p  = X'Y'  etc.  To  multiply  by  X'  requires  (p'-l)  multiplications  by  each 

of  1,  , 0/^^ , ...,  and  q(p'-l)  additions.  Now  Or^  = Y is  a q*"^  root 

of  unity.  The  total  number  of  bit  operations  required  to  multiply  q elements 
of  GF(2***)  by  l,y,Y^, . • . ,Y*^  ^ respectively  and  sum  the  results  is  given  in 
column  5 of  Table  1 (the  heading  on  the  column  will  be  explained  later). 

Hence,  we  can  determine  the  hardware  required  to  multiply  by  X'  (remember  that 
p transforms  of  length  q have  to  be  computed),  and  proceed  to  determine  the 


0 0 
0 0 


,q(p-i) 


j j^q(p-i)+i 


0 '■«' 


2q(p-l) 


2q(p-l)+2 


O'- 


,2(pq-l) 


2(p-l) 


0 -’v- 


l)(q-l 


O'-.'-'  L0'-'.‘«  "■  O'- 


,(p-i)q 


(p-l)q+p 


-^0 


•.jy(p-l)(2q-l) 


,(p-l)(pq-q) 


Xp-i)(pq-q+i) 


0 


iKpq-u 


Figure  1.  The  FFT  matrix  X.  I is  a q X q identity  matrix.  All  submatrices 
of  X are  diagonal  submatrices. 
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the  total  number  of  bit  operations  required  to  compute  the  Fourier  Transform 

using  the  FFT.  If  n = 0^^020^. . . n^ , the  overall  delay  is  at  most 

sTlog  m1  + ^log  n I + ...  + Tiog  n 1 gate  delays.  It  is  also  easy  to  show 

4,  ^ X ^ S 

that  a total  of  n(n  +n  +...+n  -s)  additions  and  n(n,+n„+. . .+n  -s-l)+l 
1 Z s 1 Z s 

multiplications  (by  quantities  other  than  1)  are  required.  However,  in  order 
to  count  bit  operations,  we  have  to  consider  the  orders  of  the  elements 
invo Ived . 

Let  us  now  consider  the  S-FFT  algorithm.  First,  let  n = 2™-l  be 
a prime.  We  note  that  (n-1)  additions,  requiring  a total  of  m(n-l)  XOR  gates 
are  needed  to  compute  Bq.  Since  1 , . . . are  the  n distinct 

nonzero  m-tuples,  we  see  that  computing 


. n-1 

^ ^ ^i 

J i=0  ’ 


requires  a total  of  m(2"'  ^-1)  XOR  gates  and  a delay  of  riog22"*  = (m-1)  gate 

delays.  Such  computations  are  required  for  m values  of  k and  I(n)  = (n-l)/m 

values  of  j.  Now,  the  squaring  operations  in  (6)  can  also  be  implemented  by 

linear  transformations  (e.g.,  [?],  p.  50)  as  can  the  multiplications  in  (5). 

The  numbers  of  ones  in  matrices  corresponding  to  squaring  can  be 

made  quite  small  if  the  minimal  polynomial  of  ot  is  chosen  carefully.  The 

numbers  of  gates  and  the  delays  in  a squaring  circuit  are  given  in  Table  2. 

(k) 

For  example,  if  m = 5,  n = 31,  then  one  can  successively  compute  , 

snd  from  using  a cascade  of  four  squaring  circuits, 

giving  a total  of  12  gates  and  a delay  of  4 gate  delays.  Alternatively, 
since  squaring  is  a linear  operation,  so  is  taking  the  4th,  8th  and  16th 
power  of  an  element.  Hence,  B^  can  be  applied  to  four  different  circuits 


t 


1 

m 

n 

Minimal  Polynomial 
of  a 

Squaring  Circuit 
# of 

Horner 

Multiplier  Circuits 
s rule  Simultaneous 

Mult. 

I 

Delay 

# of  2ates  Delay 

# of  gates 

Max.  delay 

1 

3 

7 

3 

X +X+1 

1 

1 

2 

2 

3 

1 

4 

15 

4 

X +X+1 

2 

1 

3 

3 

6 

1 

i 

1 

1 

5 

31 

x^+x^+l 

3 

1 

4 

4 

11 

2 

6 

63 

x^+x+1 

3 

1 

5 

5 

15 

1 

7 

127 

x^+x+1 

3 

1 

6 

6 

21 

1 

[ 

8 

255 

x®+x^+x^+x^+l 

20 

2 

21 

7 

86 

2 

1 

9 

5il 

x^+x^+1 

6 

1 

8 

8 

42 

2 

:| 

10 

1023 

X +x  +1 

6 

2 

9 

9 

48 

2 

11 

2047 

X +x  +1 

6 

1 

10 

10 

56 

2 

Table  2.  Numbers  of  gates  and  delays  required  to  implement  (a)  the  squaring  operation 

(b)  the  (m-1)  multiplications  in  Horner's  rule  and  (c)  simultaneous  multiplica- 
tions by  1 ,df, . . . ,0"*  Gates  and  delays  in  adder  circuits  are  not  included. 

i 
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(k)  (k)  (k)  (k) 

to  produce  simultaneously.  It  can  be  shovm  that 


this  requires  23  gates  but  the  delay  is  only  2 gate  delays. 


Ckl 

A similar  situation  exists  in  computing  the  from  the  '*s. 

Using  Horner's  rule,  we  can  compute  B^  from  the  B^*^^'s  using  (m-1)  multiplications 
by  a and  (m-1)  additions.  The  total  number  of  XOR  gates  and  delays  required 
to  implement  all  (m-1)  multiplications  by  a are  given  in  Table  2.  We  also  need 

(m-1)  adders,  which  require  m(m-l)  XOR  gates  and  delay  the  signals  by  a 

(k)  k 

further  (m-1)  gate  delay.  Alternatively,  one  can  multiply  each  B^  by  a 

(k  = 0,1,2, ... ,m-l)  simultaneously  and  add  the  products.  The  total  number  of 

gates  and  the  maximum  delay  is  listed  in  Table  2.  As  before,  (m-1)  adders 

are  required  but  the  increase  in  delay  is  only  riog2ml  gate  delays.  While 

discussing  the  direct  method,  we  saw  that  we  could  decrease  the  delay  without 

increasing  the  hardware  by  changing  to  simultaneous  multiplications.  This  is 

not  true  here.  Except  for  the  case  m = 8,  exactly  one  gate  is  required  to 

multiply  by  a giving  a total  of  (m-1)  gates  to  implement  Horner's  rule, 

while  the  (m-1)  simultaneous  multiplications  require  approximately  % m(m-l)  gates. 

Even  so,  these  latter  multiplications  require  very  little  hardware  since  the 

th  2 

average  multiplication  by  a n root  requires  %m  -m  bit  operations,  compared 
to  a total  of  %m(m-l)  bit  operations  for  (m-1)  multiplications.  The  numbers 
of  bit  operations  required  to  implement  the  S-FFT  algorithm  can  be  determined 
from  the  above. 

When  n is  not  a prime,  we  have,  as  before,  that  the  hardware  required 
(k)  i 

to  compute  Bj  depends  on  the  order  of  . This  is  given  in  column  6 of 
Table  1.  Using  these  numbers  and  Table  2,  it  is  straightforward  to  determine 
the  total  numbers  of  bit  operations  required  for  the  S-FFT  algorithm. 
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A technique  that  I call  folding  can  be  used  to  reduce  the  number  of 
bit  operations  at  the  expense  of  increased  delay  for  the  S-FFT  algorithm  and 
for  direct  computation,  whenever  n is  not  a prime.  Suppose  cr'^  = Y is  an 
element  of  order  d where  d|n.  Then 


n-1 


d-1  d 


B - r A y-  . I ( r A )v‘ 

■’  i*=0  i-0  j-0  ■’ 

2 d • X 

If  we  precompute  the  d inner  sums,  then  d multiplications  (by  l,y,Y  ) 

and  d-1  additions  suffice  to  compute  B,.  I call  the  process  of  computing 
the  inner  sums  a folding  of  A(x)  to  length  d and  denote  the  polynomial 


--1 
d-1  d 


^ A )x^  by  A(xld). 

1=0  j=0  J ^ 


Thus,  we  save  (n-d)  multiplications  and  (n-d)  additions  for  (J(d)  different 
values  of  j but  need  to  first  fold  A(x)  which  requires  n-d  extra  additions. 
Actually,  the  savings  are  even  more  as  we  illustrate  by  an  example.  Suppose 
n = 63.  Then,  we  need  to  find  A(x|21)  A(x|9),  A(x|7)  and  A(x|3).  To  compute 
the  first  two  requires  42  and  54  additions  respectively.  To  find  A(x|7),  we 
can  either  fold  A(x)  to  length  7 using  56  additions,  or  fold  A(xj21)  to  length 
7 using  only  14  additions.  Obviously,  the  latter  is  preferable.  Similarly, 
we  can  fold  A(x|9)  to  get  A(x|3)  using  only  6 additions.  Finally,  we  can  fold 
A(x|3)  to  length  1 using  2 additions  and  this  is  exactly  the  same  as  computing 
BqI  Consequently,  only  42  + 14  extra  additions  are  required  for  folding 
operations  since  the  54+6+2  additions  to  fold  A(x)  to  lengths  9,  3 and  1 
are  required  for  computing  Bq  anyway.  However,  we  save  some  multiplications 
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and  additions  in  computing  the  result  is  a saving  of  1284 


multiplications  and  1228  additions.  This  is  approximately  one-third  of  the 
total  number  of  multiplications  and  additions  required  for  direct  computation 


when  n = 63.  The  total  hardware  required  to  compute  from  A(x[d)  is  given 


in  column  5 of  Table  1.  The  savings  due  to  folding  occur  in  the  S-FFT 
algorithm  also.  Thus,  after  folding  A(x)  to  all  the  necessary  lengths,  we 


can  compute  Bj  ^ from  A'  '(x|d)  which  is  available  directly  from  A(x|d). 


The  number  of  bit  operations  required  for  this  are  given  in  column  7 of  Table  1. 
From  this,  we  can  determine  the  total  numbers  of  bit  operations  required  for 
the  S-FFT  algorithm. 

In  Table  3,  we  give  the  total  numbers  of  bit  operations  required  for 
computing  the  Fourier  Transform  by  the  various  algorithms.  When  n is  a prime, 
there  is  no  FFT  algorithm;  when  n is  not  a prime,  we  give  the  results  both 
with  and  without  folding.  We  notice,  that  as  might  be  expected,  the  FFT  is 
superior  to  the  S-FFT  for  larger  values  of  n,  while  at  shorter  lengths,  the 
S-FFT  with  folding  is  superior  to  the  FFT.  The  S-FFT  algorithm  is  superior 
to  direct  computation  by  a factor  of  (m-1)  approximately.  Folding  reduces 
the  bit  complexity  by  a factor  of  0(n)/n  approximately,  l.e.,  the  number  of 
bit  operations  to  evaluate  B^  after  folding  is  quite  small  compared  to  the 
number  required  when  folding  is  not  used.  Finally,  the  case  n > 255  is 
exceptional  in  that  the  squaring  circuits  and  multiplier  circuits  require 
unusually  large  amounts  of  hardware  and  hence  the  S-FFT  requires  more  hard- 
ware than  the  FFT. 


iJ 


! 
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n 

Algorithm 

Number  of  bit 
Operations 

7 

Direct 

216 

S-FFT 

132 

Direct 

1,788 

Direct  with  folding 

1,284 

15 

FFT 

760 

S-FFT 

778 

S-FFT  with  folding 

642 

31 

Direct 

12,000 

S-FFT 

3.480 

Direct 

71,412 

Direct  with  folding 

48,300 

63 

FFT 

11,736 

S-FFT 

16,822 

S-FFT  with  folding 

11.494 

Direct 

395,136 

127 

S-FFT 

64.764 

Direct 

2,069,788 

Direct  with  folding 

1,292,460 

255 

FFT 

178,336 

S-FFT 

328,518 

S-FFT  with  folding 

219.766 

Direct 

10,565,952 

Direct  with  folding 

9,177,126 

511 

FFT 

1,620,288 

S-FFT 

1,259,292 

S-FFT  with  folding 

1.088.706 

Direct 

52,290,464 

Direct  with  folding 

36,222,784 

1023 

FFT 

2,103,520 

S-FFT 

5,548,398 

S-FFT  with  folding 

3.798.638 

Direct 

253,453,376 

Direct  with  folding 

240,403,042 

2047 

FFT 

13,606,912 

S-FFT 

23,381,336 

S-FFT  with  folding 

22.241.274 

Table  3.  Numbers  of  bit  operations  required 
by  various  algorithms. 


to  compute  the  Fourier  Transform 


r 


IV.  Implementations  and  Applications 


The  numbers  of  bit  operations  required  to  compute  the  Fourier  Trans- 
form using  the  various  algorithms  vjas  discussed  at  some  length  in  the  previous 
section.  The  nm-input,  nm-output  combinational  network  discussed  there  is  not 
necessarily  a practical  implementation.  Tvro  more  realistic  methods  are  a 
m-input,  nm-output  sequential  network  and  a nm-input,  m-output  sequential 
network.  In  the  former,  the  input  to  the  circuit  is  the  sequence  , 

A A„.  When  the  last  symbol  is  received,  the  nm  outouts  are  the  B.'s. 

n-2  ’0  j 

In  the  latter,  the  input  is  A , ,A  A.  which  is  stored  in  the  network. 

The  network  then  successively  generates  . . . ,B^  In  the  coding 

literature,  these  are  known  as  syndrome  generating  circuits  and  Chien  search 

circuits  respectively  [7,  Chapter  5],  [9].  Unfortunately,  the  S-FFT  algorithm 

is  not  well-suited  to  either  of  these  methods.  It  requires  approximately  the 

same  number  of  flip  flops  but  considerably  more  XOR  gates  than  the  direct 

method,  1 am  not  aware  of  any  implementation  of  the  FFT  algorithm  along 

these  lines.  It  appears  to  be  even  more  difficult  to  implement  than  the  S-FFT. 

An  alternative  implementation  is  in  softvjare  with  multiplication 

being  done  by  means  of  tables  of  logarithms  and  antilogarithms  in  GF(2'"). 

2 

As  discussed  earlier,  the  direct  method  could  be  implemented  using  (n-1) 

multiplications  and  n(n-l)  additions  vdiile  the  FFT  would  require 

n(Uj^+n2+. . .+n^-s-l)+l  multiplications  and  n(nj^+n2+. . .+ng-s ) additions. 

For  the  S-FFT,  one  can  trade-off  time  for  memory  space.  Thus,  one  has 
(k) 

to  compute  B^  for  m values  of  k and  I(n)  values  of  j,  say  J2‘ * * » ^l(n)* 

o . / IN  N u .th  , ^^1  ^-*2  ^^I(n) 

One  can  store  a (n-1)  x I(n)  array  whose  I row  is  or  ,a  ,...,«  and 
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(k) 

a m < I(n)  array  containing  the  partial  sums  for  . Then,  after  testing 

A , one  either  adds  (or  does  not  add)  the  i^^  row  of  the  first  array  into 

^ h jn 

the  k row  of  the  second.  Furthermore,  if  one  stores  two  tables  (with  2 

tW  . 2 

entries  each)  x;hose  V entries  are  respectively  y and  Y3,  one  can  dispense 
with  the  log  and  antilog  tables  entirely.  As  in  the  hardv/are  implementation, 
the  squarings  and  multiplications  by  6 in  (5)  are  less  complex  than  multiplica- 
tions in  general,  since  the  former  require  only  one  table  lookup  while  the 
latter  require  two.  If  one  v/ishes  to  avoid  storing  a (n-1)  x I(n)  array, 


i-J 

then  one  can  compute  a , . . . ,o 


1 "■^l(n)  , th 


at  the  i step.  In  this  case,  the  S-FFT 

2 

requires  approximately  n /m  extra  multiplications.  As  a result,  it  cannot 
always  compete  with  the  FFT  when  n is  not  a prime;  when  n is  a prime,  the 
S-FFT  is  still  better  than  the  direct  method. 

In  recent  years,  considerable  attention  has  been  paid  to  Fast  Fourier 
Transforms  over  finite  mathematical  structures  because  no  round-off  or  trunca- 
tion errors  can  occur  in  such  transforms.  However,  the  major  application 
envisioned  for  the  S-FFT  algorithm  (and,  indeed,  the  initial  impetus  for 
considering  it)  is  in  the  encoding  and  decoding  of  BCH  and  Reed-Solomon  Codes 
[7] . Mandelbaum  [lO]  has  proposed  a technique  for  implementing  Reed-Solomon 
codes  which  requires  both  the  encoder  and  the  decoder  to  compute  a Fourier 
Transform.  It  has  been  shown  LH],  [12]  that  if  an  FFT  is  used  for  these 
transforms,  then  Mandelbaum' s technique  is  superior  over  the  usual  implementa- 
tion for  a wide  range  of  rates  and  block  lengths.  Use  of  the  S-FFT  increases 
this  range  and  also  allows  reasonable  implementations  when  the  block  length 


is  a prime. 
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For  the  usual  implementations,  one  must  compute  syndromes  i.e., 
evaluate  a polynomial  o£  degree  n-l  at  or^ , j = l,2,.,.,2t  and  locate  errors 
by  the  Chien  search,  i.e.,  find  the  Fourier  Transform  of  a polynomial  of 
degree  t.  For  binary  BCH  codes,  the  polynomial  of  degree  n-]  is  a polynomial 
over  GF(2)  and  the  S-FFT  has  no  application.  Equation  (6)  is  used  to  compute 
some  of  the  syndrome  values  but  this  result  is  well  known  in  the  coding 
literature.  However,  fast  techniques  can  be  used  for  syndrome  computation  of 
Reed-Solomon  codes  and  Chien  search  for  both  BCH  and  Reed-Solomon  codes.  Here, 
direct  computation  requires  approximately  2nt  multiplications  and  additions, 
and  nt  multiplications  and  additions  respectively.  The  reduction  in  FFT 
complexity  is  much  smaller.  If  one  thinks  of  the  FFT  as  multiplication  of  a 
row  vector  by  a succession  of  sparse  matrices,  then  (for  the  Chien  search), 
a vector  with  (t+1)  nonzero  entries  becomes  a vector  with  n^(t+l)  nonzero 
entries  after  one  matrix  multiplication  and  a vector  with  n^n^Ct+l)  nonzero 
entries  after  the  second  matrix  multiplication  etc.  so  that  the  advantages  of 
beginning  with  a polynomial  of  small  degree  are  rapidly  lost.  On  the  other 
hand,  for  the  S-FFT  Chien  search,  one  requires  mtl(n)  + nm  + t additions 
and  2nm  multiplications  approximately.  Similar  savings  are  available  in 
syndrome  computation  as  well.  In  short,  for  these  specialized  cases,  direct 
computation  and  the  S-FFT  have  reduced  complexity  in  similar  ratios  while  the 
FFT  has  not.  This  is  an  added  advantage  for  the  S-FFT  algorithm. 
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